World Bank Indicators
Preface
This project is dedicated to finding patterns in the structure of countries that are similar to each other, regarding their GDP and indicators to that. To get these insights a cluster analysis will be performed using kmeans and agglomerative hierarchical clustering. Based on the best fitting clustering results a linear regression will be performed for each cluster to see the drivers of the GDP and if they differ between the clusters. For further analysis of that, a principal component analysis can be used, yet I chose to start off with a linear regression.
The Data
The data analyzed here can be downloaded from the World Bank, that is an open database, which offers access to global development data. The used excerpt includes the Gross Domestic Product and possible indicators to that for over 200 countries between the years 2000 and 2010. The columns are explained as follows:
- Country: The name of the country that is looked at.
- Date: The year that is looked at.
- Train_Mio_Passenger_KM: Million passenger-kilometers, the number of passengers transportet by rail times the kilometers travelled.
- Cars_Per_1000_People: Number of cars per 1000 inhabitants.
- Mobile_Phone_Subscribers: Number of inhabitants subscribed to a public mobile telephone service.
- Internet_Users_Per_100_People: Number of internet users per 100 inhabitants.
- Mortality_Under_5_Per_1000_Live_Births: The number of children dying below the age of 5 per 1000 live births.
- Health_Expenditure_Per_Capita_In_USD: Expenditures on health in US Dollar per capita, including consumed healthcare goods and services.
- Total_Health_Expenditure_In_Percent_Of_GDP: Total expenditures on health in Percent of the GDP, including consumed healthcare goods and services.
- Total_Population: Total number of inhabitants.
- Urban_Population:Total number of inhabitants livin in urban areas.
- Birth_Rate_Per_1000_People: Number of live births per 1000 inhabitants.
- Life_Expectancy_Women: Life expectancy of women.
- Life_Expectancy_Men: Life expectancy of men.
- Total_Life_Expectancy: Combined life expectancy of women and men.
- Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population: Percentage of people aged between 0 and 14.
- Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population: Percentage of people aged between 15 and 64.
- Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population: Percentage of people aged 65 or older.
- Total_GDP_In_USD: Total Gross Domestic Product in US Dollar.
- GDP_Per_Capita_In_USD: Gross Domestic Product per capita in US Dollar.
Preprocessing
At first, the necessary packages and the data must be loaded.
library(readxl)
library(tidyverse)
library(factoextra)
library(plotly)
library(lubridate)
library(maps)
library(sf)
library(lwgeom)
library(gridExtra)
library(ggsci)
df <- read_excel("World Bank Indicators.xlsx")
theme_set(theme_classic())Before inspecting the data further, the columns receive shorter or more concise names and get cleaned off special characters.
names(df) <- str_replace_all(names(df), c(" " = "_",
"," = "",
":" = "",
"[(]" = "",
"[)]" = "",
"-" = "_",
"[\\$]" = "Dollar",
"%" = "Percent",
"[\\+]" = "plus"))
names(df) <- str_trim(names(df), "left")
df <- df %>%
rename(Country = Country_Name)
df <- df %>%
rename(Train_Mio_Passenger_KM = Transit_Railways_million_passenger_km,
Cars_Per_1000_People = Transit_Passenger_cars_per_1000_people)
df <- df %>%
rename(Mobile_Phone_Subscribers = Business_Mobile_phone_subscribers,
Internet_Users_Per_100_People = Business_Internet_users_per_100_people)
df <- df %>%
rename(Mortality_Under_5_Per_1000_Live_Births = Health_Mortality_under_5_per_1000_live_births,
Health_Expenditure_Per_Capita_In_USD = Health_Health_expenditure_per_capita_current_USDollar,
Total_Health_Expenditure_In_Percent_Of_GDP = Health_Health_expenditure_total_Percent_GDP)
df <- df %>%
rename(Total_Population = Population_Total_count,
Urban_Population = Population_Urban_count,
Birth_Rate_Per_1000_People = Population_Birth_rate_crude_per_1000)
df <- df %>%
rename(Life_Expectancy_Women = Health_Life_expectancy_at_birth_female_years,
Life_Expectancy_Men = Health_Life_expectancy_at_birth_male_years,
Total_Life_Expectancy = Health_Life_expectancy_at_birth_total_years)
df <- df %>%
rename(Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population = Population_Ages_0_14_Percent_of_total,
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population = Population_Ages_15_64_Percent_of_total,
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population = Population_Ages_65plus_Percent_of_total)
df <- df %>%
rename(Total_GDP_In_USD = Finance_GDP_current_USDollar,
GDP_Per_Capita_In_USD = Finance_GDP_per_capita_current_USDollar)After that a first inspection of the data follows.
Classes 'tbl_df', 'tbl' and 'data.frame': 2354 obs. of 20 variables:
$ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Date : POSIXct, format: "2000-07-01" "2001-07-01" ...
$ Train_Mio_Passenger_KM : num 0 0 0 0 0 0 0 0 0 0 ...
$ Cars_Per_1000_People : num NA NA NA NA NA NA 11 18 19 21 ...
$ Mobile_Phone_Subscribers : num 0 0 25000 200000 600000 ...
$ Internet_Users_Per_100_People : num NA 0 0 0 0 1 2 2 2 3 ...
$ Mortality_Under_5_Per_1000_Live_Births : num 151 150 150 151 150 151 151 150 150 149 ...
$ Health_Expenditure_Per_Capita_In_USD : num 11 11 22 25 30 33 24 29 32 34 ...
$ Total_Health_Expenditure_In_Percent_Of_GDP : num 8 9 7 8 9 9 7 7 7 8 ...
$ Total_Population : num 25950816 26697430 27465525 28255719 29068646 ...
$ Urban_Population : num 5527524 5771984 6025936 6289723 6563700 ...
$ Birth_Rate_Per_1000_People : num 51 50 49 48 47 47 46 45 45 44 ...
$ Life_Expectancy_Women : num 45 46 46 46 46 47 47 47 48 48 ...
$ Life_Expectancy_Men : num 45 45 46 46 46 47 47 47 47 48 ...
$ Total_Life_Expectancy : num 45 46 46 46 46 47 47 47 48 48 ...
$ Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population : num 48 48 48 48 48 48 48 47 47 47 ...
$ Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population: num 50 50 50 50 50 50 50 50 51 51 ...
$ Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population : num 2 2 2 2 2 2 2 2 2 2 ...
$ Total_GDP_In_USD : num NA 2.46e+09 4.34e+09 4.77e+09 5.70e+09 ...
$ GDP_Per_Capita_In_USD : num NA 92 158 169 196 228 251 307 367 425 ...
Country Date Train_Mio_Passenger_KM
Length:2354 Min. :2000-07-01 00:00:00 Min. : 0
Class :character 1st Qu.:2002-07-01 00:00:00 1st Qu.: 334
Mode :character Median :2005-07-01 00:00:00 Median : 1994
Cars_Per_1000_People Mobile_Phone_Subscribers
Min. : 0.0 Min. : 0
1st Qu.: 31.0 1st Qu.: 129093
Median :118.0 Median : 1200000
Internet_Users_Per_100_People Mortality_Under_5_Per_1000_Live_Births
Min. : 0.00 Min. : 2.00
1st Qu.: 2.00 1st Qu.: 10.00
Median :10.00 Median : 26.00
Health_Expenditure_Per_Capita_In_USD
Min. : 3.0
1st Qu.: 44.0
Median : 181.0
Total_Health_Expenditure_In_Percent_Of_GDP Total_Population
Min. : 0.000 Min. :9.419e+03
1st Qu.: 5.000 1st Qu.:7.140e+05
Median : 6.000 Median :5.394e+06
Urban_Population Birth_Rate_Per_1000_People Life_Expectancy_Women
Min. : 4333 Min. : 7.00 Min. :41.0
1st Qu.: 385519 1st Qu.:13.00 1st Qu.:63.0
Median : 2843030 Median :20.00 Median :74.0
Life_Expectancy_Men Total_Life_Expectancy
Min. :39.00 Min. :40.00
1st Qu.:60.00 1st Qu.:62.00
Median :69.00 Median :72.00
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population
Min. :12.00
1st Qu.:20.50
Median :30.00
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population
Min. :48.00
1st Qu.:56.00
Median :64.00
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population
Min. : 0.000
1st Qu.: 3.000
Median : 5.000
Total_GDP_In_USD GDP_Per_Capita_In_USD
Min. :1.366e+07 Min. : 87
1st Qu.:2.994e+09 1st Qu.: 904
Median :1.255e+10 Median : 3257
[ erreichte getOption("max.print") -- 4 Zeilen ausgelassen ]
# A tibble: 0 x 20
# ... with 20 variables: Country <chr>, Date <dttm>,
# Train_Mio_Passenger_KM <dbl>, Cars_Per_1000_People <dbl>,
# Mobile_Phone_Subscribers <dbl>, Internet_Users_Per_100_People <dbl>,
# Mortality_Under_5_Per_1000_Live_Births <dbl>,
# Health_Expenditure_Per_Capita_In_USD <dbl>,
# Total_Health_Expenditure_In_Percent_Of_GDP <dbl>,
# Total_Population <dbl>, Urban_Population <dbl>,
# Birth_Rate_Per_1000_People <dbl>, Life_Expectancy_Women <dbl>,
# Life_Expectancy_Men <dbl>, Total_Life_Expectancy <dbl>,
# Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population <dbl>,
# Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population <dbl>,
# Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population <dbl>,
# Total_GDP_In_USD <dbl>, GDP_Per_Capita_In_USD <dbl>
It shows that there are a lot of missing values. There are different ways to handle this. At first I tried to omit all incomplete cases. Because of the huge loss of information, as there are only a few hundred complete cases starting in the year 2003, I chose a different option. To come to a solid conclusion I replaced all missing values once with 0 and once with the median of a variable per year and did clustering with both cases. I then chose to replace the NAs with 0, to have a low distortion. Many missing values occur in very small countries or in countries with little to no infrastructure. Choosing the median led to a higher distortion than using 0. Though it later affects the linear regression analysis I kept my choice, as the wrong classification of countries weighted higher in my eyes.
[1] 6340
[1] 0
Cluster Analysis
Preparations
To avoid a bias created by different units of measurement, the numeric variables are standardized, resp. centered and scaled. There are other possibilities, such as normalization (MinMax), but the used method in class has been standardization, so it is used here, too.
Clustering Algorithms
K-Means
The K-Means algorithm is used with 25 starts and 50 iterations, as more starts did not give better results and no convergence is achieved before 50 iterations. The Lloyd algorithm is used to have a random first assignment to clusters.
HClust
The agglomerative hierarchical clustering algorithm hclust is used with the function hcut for more flexible hyperparameter tuning. There are other possibilities for clustering functions that hclust, for example agnes or diana from cluster package, but the normal function is used here. In the hcut function it can be specified that the data is not a dissimilarity matrix, which is then calculated automatically. Euclidean distance is used a a distance measure. Clustering happens also for all four linkage methods discussed in class (complete, single, average, centroid).
Number Of Clusters
To determine the best fitting number of clusters I used both, the elbow and silhouette method for the k-means and hclust algorithm.
p <- df_scaled %>%
select(-Country, -Date) %>%
fviz_nbclust(kmeans, method = "wss", nstart = 25, iter.max = 50)
p1 <- df_scaled %>%
select(-Country,-Date) %>%
fviz_nbclust(kmeans, method = "silhouette", nstart = 25, iter.max = 50)
p2 <- df_scaled %>%
select(-Country,-Date) %>%
fviz_nbclust(hcut, method = "wss")
p3 <- df_scaled %>%
select(-Country,-Date) %>%
fviz_nbclust(hcut, method = "silhouette")
grid.arrange(p, p1, p2, p3, nrow = 2, ncol = 2)Both methods return the same recommendation for each algorithm. Though both methods recommend a different number for k. The silhouette methods recommends two clusters, whereas the elbow method suggests four to five clusters. I will perform the cluster analysis with 3, 4 and 5 k. 4 and 5 k, because the elbow method suggests that and 3 k, because we divide the world into first, second and third world countries all the time and thus clusters may have an easy to perceive meaning.
Clustering with K-Means
Now I perform the actual clustering with the k-means algorithm and the different ks and compare the results visually.
km_3 <- kmeans(df_scaled[-c(1,2)], centers = 3, nstart = 25, iter.max = 50, algorithm = "Lloyd")
p4 <- fviz_cluster(km_3, data = df_scaled[-c(1,2)], geom = "point", show.clust.cent = F, ggtheme = theme_classic(), palette = "rickandmorty", repel = T)
p5 <- p4 +
geom_point(aes(text = paste0(df_scaled$Country, ":", year(df$Date)), colour = as_factor(km_3$cluster)), size = 0.2)
km_4 <- kmeans(df_scaled[-c(1,2)], centers = 4, nstart = 25, iter.max = 50, algorithm = "Lloyd")
p6 <- fviz_cluster(km_4, data = df_scaled[-c(1,2)], geom = "point", show.clust.cent = F, ggtheme = theme_classic(), palette = "rickandmorty", repel = T)
p7 <- p6 +
geom_point(aes(text = paste0(df_scaled$Country, ":", year(df$Date)), colour = as_factor(km_4$cluster)), size = 0.2)
km_5 <- kmeans(df_scaled[-c(1,2)], centers = 5, nstart = 25, iter.max = 50, algorithm = "Lloyd")
p8 <- fviz_cluster(km_5, data = df_scaled[-c(1,2)], geom = "point", show.clust.cent = F, ggtheme = theme_classic(), palette = "rickandmorty", repel = T)
p9 <- p8 +
geom_point(aes(text = paste0(df_scaled$Country, ":", year(df$Date)), colour = as_factor(km_5$cluster)), size = 0.2)
subplot(p5, p7, p9, nrows = 3, shareX = T, shareY = T)Looking at the different solutions, the four cluster solution looks best to me until now. Three clusters seem to be too less and in the five cluster solution, there is a lot of overlapping in the three clusters in the middle.
Clustering with HClust
Now I will perform agglomerative hierarchical clustering, to see if there are better fitting clusters than with k-means. I will compare the clusters visually again. Before doing that, I will compare the different dendograms for each number of clusters and linkage method, to see which linkage performs best.
hc_complete <- hcut(df_scaled[-c(1,2)], isdiss = F, hc_func = "hclust", hc_metric = "euclidean",
hc_method = "complete")
p10 <- fviz_dend(hc_complete, k_colors = "black", show_labels = F, main = "Complete Linkage")
hc_single <- hcut(df_scaled[-c(1,2)], isdiss = F, hc_func = "hclust", hc_metric = "euclidean", hc_method = "single")
p11 <- fviz_dend(hc_single, k_colors = "black", show_labels = F, main = "Single Linkage")
hc_average <- hcut(df_scaled[-c(1,2)], isdiss = F, hc_func = "hclust", hc_metric = "euclidean",
hc_method = "average")
p12 <- fviz_dend(hc_average, k_colors = "black", show_labels = F, main = "Average Linkage")
hc_centroid <- hcut(df_scaled[-c(1,2)], isdiss = F, hc_func = "hclust", hc_metric = "euclidean",
hc_method = "centroid")
p13 <- fviz_dend(hc_centroid, k_colors = "black", show_labels = F, main = "Centroid Linkage")
grid.arrange(p10, p11, p12, p13, nrow = 2, ncol = 2)Looking at the dendograms, the complete linkage gives the best fitting results as the clusters are even. Especially the centroid linkage has a messed up assignment to clusters. Looking at a different number of clusters, the centroid linkage will be used.
In the next step, the different numbers of clusters will be calculated and visualized as before with k-means.
hc_3 <- hcut(df_scaled[-c(1,2)], k = 3, isdiss = F, hc_func = "hclust", hc_metric = "euclidean",
hc_method = "complete")
p14 <- fviz_cluster(hc_3, data = df_scaled[-c(1,2)], geom = "point", show.clust.cent = F, ggtheme = theme_classic(), palette = "rickandmorty", repel = T)
p15 <- p14 + geom_point(aes(text = paste0(df_scaled$Country, ":", year(df$Date)), colour = as_factor(hc_3$cluster)), size = 0.2)
hc_4 <- hcut(df_scaled[-c(1,2)], k = 4, isdiss = F, hc_func = "hclust", hc_metric = "euclidean",
hc_method = "complete")
p16 <- fviz_cluster(hc_4, data = df_scaled[-c(1,2)], geom = "point", show.clust.cent = F, ggtheme = theme_classic(), palette = "rickandmorty", repel = T)
p17 <- p16 + geom_point(aes(text = paste0(df_scaled$Country, ":", year(df$Date)), colour = as_factor(hc_4$cluster)), size = 0.2)
hc_5 <- hcut(df_scaled[-c(1,2)], k = 5, isdiss = F, hc_func = "hclust", hc_metric = "euclidean",
hc_method = "complete")
p18 <- fviz_cluster(hc_5, data = df_scaled[-c(1,2)], geom = "point", show.clust.cent = F, ggtheme = theme_classic(), palette = "rickandmorty", repel = T)
p19 <- p18 + geom_point(aes(text = paste0(df_scaled$Country, ":", year(df$Date)), colour = as_factor(hc_5$cluster)), size = 0.2)
subplot(p15, p17, p19, nrows = 3, shareX = T, shareY = T)As we can see here, the cluster assignment is not that useful in my opinion. There is one giant cluster and two/three/four very small clusters with, you could say outliers, like China or India, that form their own cluster. Comparing these with the result from k-means clustering, the four clusters generated there seem to fit the data best.
To me, the k-means solution with four clusters seems to be the most reasonable solution for a classification of the different countries in different years. So I chose to further work with that solution.
I then tried to come up with meaningful names for each cluster, but discarded the upcoming ideas, as I couldn’t find names that are conclusive enough. Yet, I want to give some information for each cluster.
Cluster 1: The smallest cluster, containing only data from India, China and the USA. To me, that seemed quite reasonable, as all these countries have a high population and a great territory. The whole structure of India and China resembles each other, e.g. life expectancy, total population or health expenditure. The USA have a similar age structure and also a high total population. However, they could have been classified as cluster 4, too, as they are also near that.
Cluster 2: This cluster contains many countries that we would naturally classify as third world countries. Among them are many countries in Africa, South-East Asia and South and Middle America. They all have a lower GDP, compared to states in cluster 4, and resemble each other in terms of health expenditure, age structure or birth rate. Genuinely, I would classify them as poorer countries with less infrastructure.
Cluster 3: In this cluster many small countries and island states are grouped together. They all have their population in common and, unfortunately, many values that are zero. Among them are countries like South Sudan with little to no infrastructure, thus containing formerly missing values as 0. Nevertheless, to me it made sense to group these together, as they resemble each other, not only in the zero values, but in height of GDP and total population.
Cluster 4: Many of the countries in this cluster would genuinely be described as first world countries. Almost every european country has been grouped in this cluster, as well as Canada, Australia and New Zealand. They resemble each other in their infrastructure (e.g. train passengers and cars, mobile phone and internet users), age structure or child mortality. Most of them have a high urban population. I would describe them as richer, urban countries with a good infrastructure.
There are two more things that can be observed, when looking at the cluster distribution on a map. First of all, some countries change their cluster over time. The reasons for that can be a better or worse infrastructure and so on, but can’t be determined here in detail. The second thing that catches your eye is, that many countries that are geographically near each other share the same cluster.
To look at the clusters on a world map, there first need to be some adjustments in the names of countries, to make it easier to join them to their geodata. Then the geodata is loaded, joined with the data and visualized in a map. Unfortunately there isn’t geodata available in R for some very small countries, that thus aren’t visualized here.
df$Country <- gsub("Antigua and Barbuda", "Antigua", df$Country)
df$Country <- gsub("Brunei Darussalam", "Brunei", df$Country)
df$Country <- gsub("Congo, Dem. Rep.", "Democratic Republic of the Congo", df$Country)
df$Country <- gsub("Congo, Rep.", "Republic of Congo", df$Country)
df$Country <- gsub("Cote d'Ivoire", "Ivory Coast", df$Country)
df$Country <- gsub("Egypt, Arab Rep.", "Egypt", df$Country)
df$Country <- gsub("Gambia, The", "Gambia", df$Country)
df$Country <- gsub("Hong Kong SAR, China", "China:Hong Kong", df$Country)
df$Country <- gsub("Iran, Islamic Rep.", "Iran", df$Country)
df$Country <- gsub("Korea, Dem. Rep.", "North Korea", df$Country)
df$Country <- gsub("Korea, Rep.", "South Korea", df$Country)
df$Country <- gsub("Kyrgyz Republic", "Kyrgyzstan", df$Country)
df$Country <- gsub("Lao PDR", "Laos", df$Country)
df$Country <- gsub("Macao SAR, China", "China:Macao", df$Country)
df$Country <- gsub("Macedonia, FYR", "Macedonia", df$Country)
df$Country <- gsub("Russian Federation", "Russia", df$Country)
df$Country <- gsub("Sint Maarten [(]Dutch part[)]", "Sint Maarten", df$Country)
df$Country <- gsub("Slovak Republic", "Slovakia", df$Country)
df$Country <- gsub("St. Lucia", "Saint Lucia", df$Country)
df$Country <- gsub("St. Martin [(]French part[)]", "Saint Martin", df$Country)
df$Country <- gsub("St. Vincent and the Grenadines", "Saint Vincent", df$Country)
df$Country <- gsub("Syrian Arab Republic", "Syria", df$Country)
df$Country <- gsub("United Kingdom", "UK", df$Country)
df$Country <- gsub("United States", "USA", df$Country)
df$Country <- gsub("Venezuela, RB", "Venezuela", df$Country)
df$Country <- gsub("Yemen, Rep.", "Yemen", df$Country)
df$Country <- gsub("Bahamas, The", "Bahamas", df$Country)
df$Country <- gsub("Faeroe Islands", "Faroe Islands", df$Country)
df$Country <- gsub("Micronesia, Fed. Sts.", "Micronesia", df$Country)
df$Country <- gsub("St. Kitts and Nevis", "Saint Kitts", df$Country)
df$Country <- gsub("Trinidad and Tobago", "Trinidad", df$Country)
df$Country <- gsub("Virgin Islands [(]U.S.[])]", "Virgin Islands, US", df$Country)
df$Country <- gsub("West Bank and Gaza", "Palestine", df$Country)
sf_map <- st_as_sf(map("world", plot = F, fill = T))
sf_map <- st_make_valid(sf_map)
df <- full_join(df, sf_map, by = c("Country" = "ID"))
df <- st_as_sf(df)
p20 <- df %>%
filter(year(Date) == 2000) %>%
ggplot(aes(fill = cluster)) +
geom_sf() +
scale_fill_rickandmorty() +
labs(title = "2000")
p21 <- df %>%
filter(year(Date) == 2002) %>%
ggplot(aes(fill = cluster)) +
geom_sf() +
scale_fill_rickandmorty() +
labs(title = "2002")
p22 <- df %>%
filter(year(Date) == 2004) %>%
ggplot(aes(fill = cluster)) +
geom_sf() +
scale_fill_rickandmorty() +
labs(title = "2004")
p23 <- df %>%
filter(year(Date) == 2006) %>%
ggplot(aes(fill = cluster)) +
geom_sf() +
scale_fill_rickandmorty() +
labs(title = "2006")
p24 <- df %>%
filter(year(Date) == 2008) %>%
ggplot(aes(fill = cluster)) +
geom_sf() +
scale_fill_rickandmorty() +
labs(title = "2008")
p25 <- df %>%
filter(year(Date) == 2010) %>%
ggplot(aes(fill = cluster)) +
geom_sf() +
scale_fill_rickandmorty() +
labs(title = "2010")
grid.arrange(p20, p21, p22, p23, p24, p25, nrow = 3, ncol = 2)Linear Regression Analysis
Now that the countries have been clustered into similar groups, I want to know if the internal structure of the clusters, leading to the total GDP in US Dollar differs from each other. To get a first glimpse at the structure I assume that the relationship between the dependent variable GDP and the other independet variables is linear. The regression analysis will show if it’s true or if other models or algorithms would fit the data better.
The overall data frame df is split up into its clusters before performing regression analysis, so the code is easier to read.
cluster_1 <- df %>% filter(cluster == 1)
cluster_2 <- df %>% filter(cluster == 2)
cluster_3 <- df %>% filter(cluster == 3)
cluster_4 <- df %>% filter(cluster == 4)The overall regression model is set in the next step. I assumed that the total GDP in US Dollar is linearly dependent on all other variables, except for the GDP per capita.
model <- as.formula(Total_GDP_In_USD ~ Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population +
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population +
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population +
Total_Life_Expectancy + Life_Expectancy_Men + Life_Expectancy_Women +
Birth_Rate_Per_1000_People + Urban_Population + Total_Population +
Total_Health_Expenditure_In_Percent_Of_GDP +
Health_Expenditure_Per_Capita_In_USD +
Mortality_Under_5_Per_1000_Live_Births + Internet_Users_Per_100_People +
Mobile_Phone_Subscribers + Cars_Per_1000_People + Train_Mio_Passenger_KM)Now, the regression analysis is performed for each cluster.
Call:
lm(formula = model, data = cluster_1)
Residuals:
Min 1Q Median 3Q Max
-2.839e+11 -7.623e+10 8.758e+09 6.676e+10 1.901e+11
Coefficients:
Estimate
(Intercept) 8.153e+12
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population -1.964e+11
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population -2.875e+11
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population -1.676e+11
Total_Life_Expectancy -8.466e+10
Life_Expectancy_Men 1.013e+11
Life_Expectancy_Women 2.155e+11
Birth_Rate_Per_1000_People 1.953e+11
Urban_Population 2.061e+04
Total_Population -7.723e+03
Total_Health_Expenditure_In_Percent_Of_GDP -3.944e+11
Health_Expenditure_Per_Capita_In_USD 1.158e+09
Mortality_Under_5_Per_1000_Live_Births 2.258e+09
Internet_Users_Per_100_People 6.718e+10
Mobile_Phone_Subscribers -5.848e+02
Std. Error
(Intercept) 1.471e+13
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 1.420e+11
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population 1.240e+11
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 1.285e+11
Total_Life_Expectancy 1.199e+11
Life_Expectancy_Men 1.166e+11
Life_Expectancy_Women 1.553e+11
Birth_Rate_Per_1000_People 1.091e+11
Urban_Population 6.242e+03
Total_Population 2.117e+03
Total_Health_Expenditure_In_Percent_Of_GDP 1.204e+11
Health_Expenditure_Per_Capita_In_USD 1.839e+08
Mortality_Under_5_Per_1000_Live_Births 4.432e+10
Internet_Users_Per_100_People 1.250e+10
Mobile_Phone_Subscribers 9.226e+02
t value
(Intercept) 0.554
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population -1.384
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population -2.319
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population -1.304
Total_Life_Expectancy -0.706
Life_Expectancy_Men 0.868
Life_Expectancy_Women 1.387
Birth_Rate_Per_1000_People 1.791
Urban_Population 3.302
Total_Population -3.648
Total_Health_Expenditure_In_Percent_Of_GDP -3.275
Health_Expenditure_Per_Capita_In_USD 6.297
Mortality_Under_5_Per_1000_Live_Births 0.051
Internet_Users_Per_100_People 5.375
Mobile_Phone_Subscribers -0.634
Pr(>|t|)
(Intercept) 0.58710
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 0.18544
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population 0.03396 *
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 0.21056
Total_Life_Expectancy 0.49017
Life_Expectancy_Men 0.39796
Life_Expectancy_Women 0.18438
Birth_Rate_Per_1000_People 0.09224 .
Urban_Population 0.00450 **
Total_Population 0.00217 **
Total_Health_Expenditure_In_Percent_Of_GDP 0.00476 **
Health_Expenditure_Per_Capita_In_USD 1.06e-05 ***
Mortality_Under_5_Per_1000_Live_Births 0.96000
Internet_Users_Per_100_People 6.19e-05 ***
Mobile_Phone_Subscribers 0.53510
[ erreichte getOption("max.print") -- 2 Zeilen ausgelassen ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.589e+11 on 16 degrees of freedom
Multiple R-squared: 0.9995, Adjusted R-squared: 0.9991
F-statistic: 2184 on 16 and 16 DF, p-value: < 2.2e-16
We can see that not all variables have an influence on the GDP that is not random. In this cluster it is the share of people between 15 to 64 years in age, the urban and total population, the health expenditure in total in percent of GDP and per capita in US Dollar, as well as the number of internet users per 100 inhabitants. Our R-squared statistics is very high with over 99%, which means through this model, over 99% of variance is explained. If we would perfrom the regression again with the significant variables only, the R-squared would be a little smaller, as the random influence from all other variables, that may explain some variance, would be dropped, but it would stay at this high level.
To see if the model really fits and no other problem may be underlying, I visually inspect the residuals of this model.
At first glance most things seem to be fine. The residuals vs. fitted plot shows that there may be a slightly non-linear influence, but overall the red line should be vertical, which is the case. This plot, as well as the scale-location plot can show heteroskedasticity. A linear regression assumes homoskedasticity, so if the red line is not vertical it indicates heteroskedasticity. Here we have an almost vertical line, but it shows a slight curve. It could be tried to transform the dependent variable (log or squareroot), so that this plot only shows homoskedasticity, but for now I am fine with it. The normal Q-Q plot indicates that all residuals are normally distributed, as they all gather around the drawn line. The last plot, residuals vs. leverage, shows that there aren’t any outliers with a strong influence on the regression line. If there would be strong outliers, they would be located outside of the cook’s distance lines.
Call:
lm(formula = model, data = cluster_2)
Residuals:
Min 1Q Median 3Q Max
-3.028e+11 -1.181e+10 9.333e+08 8.454e+09 4.704e+11
Coefficients:
Estimate
(Intercept) -4.579e+10
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 1.016e+09
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population -6.390e+07
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 1.387e+08
Total_Life_Expectancy 2.305e+09
Life_Expectancy_Men 6.656e+08
Life_Expectancy_Women -2.513e+09
Birth_Rate_Per_1000_People 2.912e+08
Urban_Population 5.617e+03
Total_Population -1.405e+03
Total_Health_Expenditure_In_Percent_Of_GDP -1.295e+09
Health_Expenditure_Per_Capita_In_USD 1.143e+08
Mortality_Under_5_Per_1000_Live_Births 6.201e+07
Internet_Users_Per_100_People 2.219e+07
Mobile_Phone_Subscribers 1.489e+03
Std. Error
(Intercept) 2.497e+10
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 9.580e+08
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population 2.050e+08
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 3.604e+08
Total_Life_Expectancy 3.160e+09
Life_Expectancy_Men 1.752e+09
Life_Expectancy_Women 1.679e+09
Birth_Rate_Per_1000_People 4.093e+08
Urban_Population 1.765e+02
Total_Population 9.359e+01
Total_Health_Expenditure_In_Percent_Of_GDP 4.772e+08
Health_Expenditure_Per_Capita_In_USD 9.934e+06
Mortality_Under_5_Per_1000_Live_Births 6.726e+07
Internet_Users_Per_100_People 1.560e+08
Mobile_Phone_Subscribers 1.095e+02
t value
(Intercept) -1.834
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 1.061
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population -0.312
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 0.385
Total_Life_Expectancy 0.730
Life_Expectancy_Men 0.380
Life_Expectancy_Women -1.497
Birth_Rate_Per_1000_People 0.711
Urban_Population 31.824
Total_Population -15.009
Total_Health_Expenditure_In_Percent_Of_GDP -2.713
Health_Expenditure_Per_Capita_In_USD 11.503
Mortality_Under_5_Per_1000_Live_Births 0.922
Internet_Users_Per_100_People 0.142
Mobile_Phone_Subscribers 13.595
Pr(>|t|)
(Intercept) 0.06686 .
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 0.28901
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population 0.75533
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 0.70046
Total_Life_Expectancy 0.46580
Life_Expectancy_Men 0.70401
Life_Expectancy_Women 0.13465
Birth_Rate_Per_1000_People 0.47695
Urban_Population < 2e-16 ***
Total_Population < 2e-16 ***
Total_Health_Expenditure_In_Percent_Of_GDP 0.00675 **
Health_Expenditure_Per_Capita_In_USD < 2e-16 ***
Mortality_Under_5_Per_1000_Live_Births 0.35673
Internet_Users_Per_100_People 0.88692
Mobile_Phone_Subscribers < 2e-16 ***
[ erreichte getOption("max.print") -- 2 Zeilen ausgelassen ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.196e+10 on 1373 degrees of freedom
Multiple R-squared: 0.7916, Adjusted R-squared: 0.7892
F-statistic: 326 on 16 and 1373 DF, p-value: < 2.2e-16
Now that we look at the summary of the regression analysis for the second cluster, we can see that here other variables have a statistically significant influence on the GDP. Same as for cluster 1, the urban and total population and the health expenditures in total and per capita have a non random influence. Different from there, the number of mobile phone subscribers and the number of cars per 1000 people seems to have a significant influence on the GDP. Unlike before, the adjusted R-squared measure for model accuracy is way lower. Only 78,92% of variance is explained. Generally it is said to accept models with a R-squared above 80%, so something doesn’t seem to fit here. To inspect this further, we look again at the plots for inspecting the residuals.
The residuals vs. fitted plot shows a nice vertical line, indicating that the relationship between our dependent variable GDP and the independent variables is a linear one, just as we assumed. The scale-location plot on the other hand implies that the variance is not constant, therefore heteroskedastic. To change that a logarithmic transformation of the dependent variable can be performed. The quantile-quantile plot shows something unwanted, too. The points at the beginning and end of the plot are too far away from the line to assume a normal distribution of the residuals, which is desired. In the residuals vs. leverage plot, a strong outlier outside the cook’s distance lines can be detected. The regression analysis should be run again, with the significant independent variables only, a transformed dependent variable and without the outlier or with a smoothed outlier (e.g. via winsorization), to see whether that changes the r-squared to a higher one and achieves the desired attributes, such as homoskedasticity.
Call:
lm(formula = model, data = cluster_3)
Residuals:
Min 1Q Median 3Q Max
-3.533e+09 -3.801e+08 -2.091e+08 2.806e+08 3.366e+09
Coefficients: (7 not defined because of singularities)
Estimate
(Intercept) 2.934e+08
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population NA
Total_Life_Expectancy NA
Life_Expectancy_Men NA
Life_Expectancy_Women NA
Birth_Rate_Per_1000_People 2.460e+07
Urban_Population 3.592e+03
Total_Population -3.638e+01
Total_Health_Expenditure_In_Percent_Of_GDP -7.574e+07
Health_Expenditure_Per_Capita_In_USD 5.008e+05
Mortality_Under_5_Per_1000_Live_Births 7.345e+06
Internet_Users_Per_100_People 3.522e+07
Mobile_Phone_Subscribers -1.797e+04
Std. Error
(Intercept) 1.296e+08
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population NA
Total_Life_Expectancy NA
Life_Expectancy_Men NA
Life_Expectancy_Women NA
Birth_Rate_Per_1000_People 1.081e+07
Urban_Population 3.206e+03
Total_Population 4.057e+01
Total_Health_Expenditure_In_Percent_Of_GDP 2.742e+07
Health_Expenditure_Per_Capita_In_USD 7.696e+04
Mortality_Under_5_Per_1000_Live_Births 9.257e+06
Internet_Users_Per_100_People 4.571e+06
Mobile_Phone_Subscribers 3.604e+03
t value
(Intercept) 2.264
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population NA
Total_Life_Expectancy NA
Life_Expectancy_Men NA
Life_Expectancy_Women NA
Birth_Rate_Per_1000_People 2.277
Urban_Population 1.120
Total_Population -0.897
Total_Health_Expenditure_In_Percent_Of_GDP -2.762
Health_Expenditure_Per_Capita_In_USD 6.507
Mortality_Under_5_Per_1000_Live_Births 0.793
Internet_Users_Per_100_People 7.705
Mobile_Phone_Subscribers -4.985
Pr(>|t|)
(Intercept) 0.02464 *
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population NA
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population NA
Total_Life_Expectancy NA
Life_Expectancy_Men NA
Life_Expectancy_Women NA
Birth_Rate_Per_1000_People 0.02386 *
Urban_Population 0.26384
Total_Population 0.37093
Total_Health_Expenditure_In_Percent_Of_GDP 0.00627 **
Health_Expenditure_Per_Capita_In_USD 5.92e-10 ***
Mortality_Under_5_Per_1000_Live_Births 0.42847
Internet_Users_Per_100_People 5.77e-13 ***
Mobile_Phone_Subscribers 1.33e-06 ***
[ erreichte getOption("max.print") -- 2 Zeilen ausgelassen ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 927300000 on 202 degrees of freedom
Multiple R-squared: 0.5865, Adjusted R-squared: 0.5681
F-statistic: 31.83 on 9 and 202 DF, p-value: < 2.2e-16
The first thing that catches the eye here are the many missing values. As mentioned in the beginning, this is due to my choice of replacing missing values with a 0. In this cluster many very small countries are represented that weren’t able to collect the desired data, leading to variables that are completely missing. Nevertheless, there are some variables that have a significant influence on the GDP. As seen in the clusters before, these are the total and per capita health expenditures, the number of internet user per 100 people, the number of mobile phone users and the number of cars per 1000 people. Unlike before, the birth rate per 1000 people seems to play a role here. Unfortunately, the r-squared statistic looks even worse here, with only explaining 56,81% of the variance. It can be concluded that this model doesn’t fit the data, which might be because of the many zero values. Maybe replacing missing values with the 25%-percentile would have been a better option, as the median lead to a too high distortion, mentioned in the beginning.
Nevertheless, I want to inspect the residuals visually.
The residuals vs. fitted plot indicates a linearity, as the red line is nearly vertical. But as before, the next plot (scale-location) indicates heteroskedasticity and the quantile-quantile plot implies that the residuals are not distributed normally, as they aren’t near ne line, except for the middle. Luckily no strong outliers can be detected using the cook’s distance in the residuals vs. leverage plot. To achieve better results, resp. a more accurate model and the desired attributes of homoskedasticity and normally distributed residuals, I would recommend to run the regression analysis again, using the significant variables only and transforming the dependent variable (logarithmic or squareroot) beforehand. Changing the replacement of missing values from zero to the 25%-percentile must have been done before doing the clustering so that all missing values are replaced similarly, may leading to different cluster assignments. It could be tried if there would be better clusters, but to only influence the regression analysis of this current cluster it is not a suitable method here.
Call:
lm(formula = model, data = cluster_4)
Residuals:
Min 1Q Median 3Q Max
-1.795e+12 -1.435e+11 -3.330e+10 1.247e+11 1.302e+12
Coefficients:
Estimate
(Intercept) -3.674e+12
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 2.771e+10
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population -3.828e+09
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 5.289e+09
Total_Life_Expectancy 3.338e+10
Life_Expectancy_Men 6.136e+10
Life_Expectancy_Women -4.602e+10
Birth_Rate_Per_1000_People -6.341e+09
Urban_Population 1.509e+04
Total_Population -8.324e+03
Total_Health_Expenditure_In_Percent_Of_GDP 5.584e+09
Health_Expenditure_Per_Capita_In_USD -2.116e+07
Mortality_Under_5_Per_1000_Live_Births 1.088e+10
Internet_Users_Per_100_People 2.523e+09
Mobile_Phone_Subscribers 6.283e+03
Std. Error
(Intercept) 5.182e+11
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 4.460e+09
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population 1.462e+09
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 4.766e+09
Total_Life_Expectancy 3.191e+10
Life_Expectancy_Men 1.815e+10
Life_Expectancy_Women 1.773e+10
Birth_Rate_Per_1000_People 6.033e+09
Urban_Population 4.389e+03
Total_Population 3.742e+03
Total_Health_Expenditure_In_Percent_Of_GDP 6.119e+09
Health_Expenditure_Per_Capita_In_USD 1.219e+07
Mortality_Under_5_Per_1000_Live_Births 3.407e+09
Internet_Users_Per_100_People 7.014e+08
Mobile_Phone_Subscribers 9.174e+02
t value
(Intercept) -7.090
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 6.214
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population -2.618
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 1.110
Total_Life_Expectancy 1.046
Life_Expectancy_Men 3.380
Life_Expectancy_Women -2.595
Birth_Rate_Per_1000_People -1.051
Urban_Population 3.438
Total_Population -2.224
Total_Health_Expenditure_In_Percent_Of_GDP 0.913
Health_Expenditure_Per_Capita_In_USD -1.735
Mortality_Under_5_Per_1000_Live_Births 3.195
Internet_Users_Per_100_People 3.597
Mobile_Phone_Subscribers 6.849
Pr(>|t|)
(Intercept) 3.28e-12 ***
Share_Of_People_Aged_65plus_In_Percent_Of_Total_Population 8.84e-10 ***
Share_Of_People_Aged_15_To_64_In_Percent_Of_Total_Population 0.009040 **
Share_Of_People_Aged_0_To_14_In_Percent_Of_Total_Population 0.267425
Total_Life_Expectancy 0.295811
Life_Expectancy_Men 0.000765 ***
Life_Expectancy_Women 0.009653 **
Birth_Rate_Per_1000_People 0.293630
Urban_Population 0.000621 ***
Total_Population 0.026442 *
Total_Health_Expenditure_In_Percent_Of_GDP 0.361753
Health_Expenditure_Per_Capita_In_USD 0.083104 .
Mortality_Under_5_Per_1000_Live_Births 0.001463 **
Internet_Users_Per_100_People 0.000344 ***
Mobile_Phone_Subscribers 1.63e-11 ***
[ erreichte getOption("max.print") -- 2 Zeilen ausgelassen ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.055e+11 on 702 degrees of freedom
Multiple R-squared: 0.8515, Adjusted R-squared: 0.8481
F-statistic: 251.5 on 16 and 702 DF, p-value: < 2.2e-16
The structure of this cluster differs from the others. As seen before, the urban and total population, the internet users per 100 people and the number of mobile phone subscribers have a significant influence on the GDP. Different from the analyses before, the share of people over 65 and between 15 and 64, as well as the life expectancy of men and women, the child mortality and the million passenger-km have an influence on the height of the GDP. It is also good to see that the adjusted R-squared measure has a value above 80% (84,81%), indicating that this is a well fitting model to describe the underlying data.
To see if the other assumptions from the linear regression are fulfilled, too, we look again at the residuals visually.
The residuals vs. fitted plot indicates linearity, as in the cases before, but not that clearly. And as before the scale-location plot indicates heteroskedasticity. The quantile-quantile plot however looks a bit better in my opinion than in the two cases before. Only a few data points aren’t near the line and the rest matches exactly. The residuals vs. leverage plot looks good, too, as no strong outliers are detected and the cook’s distance line is almost invisible. To get rid of the indicated heteroskedasticity, I would recommend to perform the regression analysis again with a transformed dependent variable.
Conclusion
The clusters determined by the k-means clustering algorithm using k = 4 gives appropriate results in my opinion. To achieve even better results it could be performed using the 25%-percentile or any other better fitting percentile as a replacement for missing values than zero. It hasn’t been done here, as I tried out some other replacements like the mean and median that weren’t suitable and therefore chose to simply replace all missing values with zero.
In the regression part the different replacement would have made a difference for the third cluster, as it contained many missing values beforehand. But then again, this cluster won’t exist as it is now. Other than that, a different structure was identified in different clusters leading to the total GDP in US Dollar. Despite the differences that are sometimes bigger, sometimes smaller, there were a few similarities, like the population, that had a significant influence in all clusters. The linear regression is a fitting model here in my opinion, as the visual inspection of residuals indicated linearity in all clusters. The heteroskedasticity implied in some clusters could be changed by transforming the dependent variable.